Introducing EMO: AI Tool for Converting Photos to Videos

MetaversePlanet February 29, 2024Last Updated: February 3, 2025

0 1 minute read

Researchers Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo from the Intelligent Computing Institute of Alibaba Group have introduced EMO, an innovative artificial intelligence tool.

EMO is designed to transform photos into videos, enabling the people depicted in these photos to speak and sing in any chosen voice.

This AI tool is capable of reading selected texts and altering facial expressions in the photos to match the content of the texts fluently, providing a seamless and lifelike experience.

Contents show

Mouth movements change in accordance with the words

The most notable feature of EMO is not merely its ability to animate photos or images to speak—a capability seen in numerous other applications.

What sets this AI tool apart is its ability to animate visuals in response to a variety of sounds beyond a predefined setup. Moreover, it accurately synchronizes mouth movements to match spoken words, effectively converting an image into a video that aligns with the accompanying sound.

Another significant aspect of this artificial intelligence tool is its ability to adapt its tempo based on the audio source.

The AI discerns the difference between calm speech and rapid-fire rapping, adjusting the tempo of gestures, facial expressions, and mouth movements in the animation to match. Impressively, this AI can also bring to life animated characters, AI-generated images, or anime characters, enabling them to speak in a synchronized manner with the sound.

So how does it work?

Introducing EMO: AI Tool for Converting Photos to Videos

The researchers have disclosed that at its core, the artificial intelligence model comprises two primary components. The first component analyzes the image and generates moving frames based on the reference image.

The second component processes the audio file, identifying crucial points within it. Subsequently, these key points are aligned with the visuals. Furthermore, the AI is equipped with two control modules.

One ensures the character’s appearance in the image remains consistent, while the other module oversees the audio aspects. The outcomes from both modules are then seamlessly integrated to produce the final result.